Practical 1 aims at analyzing water levels of the longest (i.e. 295km) Swiss river (rises in the Bernese Alps and ends at the junction with the Rhine) at a measuring station in Untersiggenthal next to the Paul Scherrer Institute where nuclear reactors are located, making this location very sensitive in case of flood damage.
The next table displays the first six rows of the niveau data set. The data set collects the Aare’s daily maximum water levels in one unique station in Stilli, Untersiggenthal (canton of Aargau) and records the exact times at which daily maxima are detected.
| Stationsname | Stationsnummer | Parameter | Zeitreihe | Parametereinheit | Gewässer | Zeitstempel | Zeitpunkt_des_Auftretens | Wert | Freigabestatus |
|---|---|---|---|---|---|---|---|---|---|
| Untersiggenthal, Stilli | 2205 | Pegel | Tagesmaxima | m ü.M. | Aare | 2000-01-01 00:00:00 | 2000-01-01 00:23:00 | 326.245 | Freigegeben, validierte Daten |
| Untersiggenthal, Stilli | 2205 | Pegel | Tagesmaxima | m ü.M. | Aare | 2000-01-02 00:00:00 | 2000-01-02 00:43:10 | 326.153 | Freigegeben, validierte Daten |
| Untersiggenthal, Stilli | 2205 | Pegel | Tagesmaxima | m ü.M. | Aare | 2000-01-03 00:00:00 | 2000-01-03 00:00:00 | 326.053 | Freigegeben, validierte Daten |
| Untersiggenthal, Stilli | 2205 | Pegel | Tagesmaxima | m ü.M. | Aare | 2000-01-04 00:00:00 | 2000-01-04 01:43:40 | 325.871 | Freigegeben, validierte Daten |
| Untersiggenthal, Stilli | 2205 | Pegel | Tagesmaxima | m ü.M. | Aare | 2000-01-05 00:00:00 | 2000-01-05 21:23:00 | 325.837 | Freigegeben, validierte Daten |
| Untersiggenthal, Stilli | 2205 | Pegel | Tagesmaxima | m ü.M. | Aare | 2000-01-06 00:00:00 | 2000-01-06 03:33:05 | 325.835 | Freigegeben, validierte Daten |
The below plot shows the evolution of daily water maxima over 21 years from 01-01-2000 to 01-08-2021. We see that the lowest troughs are quite constant over the years (~325.5 meter above sea level) whereas the highest peaks really distinguish themselves among the overall peaks. Indeed, we observe that peaks are mainly observed in the Summer months with the highest peaks recorded on 25-08-2005 at 328.827 (m.a.s.l), on 09-06-2007 at 329.323 (m.a.s.l) and on 14-07-2021 at 328.622 (m.a.s.l). Therefore, over 21 years three years have had extremely and unusually high water levels during the summer, respectively 2005, 2007 and 2021.
We also observe a pattern of the maxima better shown by the green smoothed curve which indicates that water levels fluctuate according to a specific logic within a year (e.g. heavier rain in November and April).
The following output shows the five-number statistics summary of the water level values. The summary shows that the water level values are not symmetrical around the mean nor around the median but much more tightly grouped on the left hand side of the mean (and the median) than on its right hand side (i.e. more dispersed), indicating that the water level maxima’s distribution is right-skewed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 325.4 325.6 325.8 325.9 326.1 329.3
The below histogram (i.e. frequency distribution) confirms the above observations. Indeed, we see that the data is right-skewed (i.e. therefore non-normal) indicating that the data is disproportionately distributed on the right where daily maxima happen to have much higher values than usual (i.e. outliers) that need to be investigated.
The blue curve represents the smoothed version of the histogram which reflects more accurately the probability density function of the values. The dark green vertical line draws the median water level value, 325.8 meter above sea, and the light green vertical line draws the average, 325.9.
Looking at the positions on the x-axis of the mean and the median [and the mode] with regard to the distribution, we see that the mean seems to be a better indicator of the center of the distribution, even though it is not centered around it.
The objective behind the EVA is to find valid estimates of the frequency (expressed as a return period) of extreme water level maxima. Extreme values are usually infrequent but have a large impact on whatever is the focus on interest (here flood risk) which is why they represents the tail of the distribution.
The Peak-Over-Threshold method provides a model for independent exceedences (i.e. extreme values) above a large threshold. Therefore, it identifies values that are high enough to be above a designated threshold \(\mu\), above which we consider are the extremes (i.e. highest values). In order to determine an optimal threshold, we apply the MRL-plot and then look at the distribution of the exceedences. The value of \(u\) from which the plot becomes approximately linear can generally be selected as the optimal threshold.
Therefore, to model the high water levels, we first proceed to an MRL-plot to choose the optimal threshold \(u\) and then, we use the Peak-over-Threshold method and look at the distribution of the exceedences to assess their probability of occurrence.
Clusters of the extremes correspond to the clustering of the data points that are above the chosen threshold \(u\).
To measure and deal with clustering of the daily water levels, we consider consecutive threshold exceedences to belong to the same cluster. Therefore, by using the POT approach, we can observe on the Peak-Over-Threshold plot (below) the different clusters of the extreme values. Then, we can fit the Generalized Pareto Distribution (GPD) or Point Process model to the cluster maxima (after declustering if the exceedences exhibit autocorrrelation).
First, we start by selecting an appropriate threshold \(u\) to use for fitting the GPD or Point Process models. To do so, we use the mrlplot() function from the extRemes package which plots the potential thresholds against the mean excess.
The mean excess function \(e(u) = E(X - u| X > u)\) is the expected value of the excess (i.e. difference between the exceedance and the high threshold (i.e. \(u\))), knowing that the exceedance is greater than the threshold.
We are interested in finding the distribution of the daily water level maxima that are above the threshold \(u\) (i.e. exceedences), which is the excess distribution function \(F_u(x) = P(X - u \leq x | X > u)\).
The difficulty of choosing an adequate threshold \(u\) lies in the fact that it requires a good balance between selecting a threshold low enough so that we have enough exceedences to obtain useful results (i.e. results precision) but high enough so that the GPD and Point Process methods are approximately valid (i.e. bias).
[Notes for ourselves in case of a question: “If we take the threshold to be so low that most of the data are exceedences then the analysis will only be valid if the raw data came from a generalized Pareto distribution (which will rarely be the case) = 80% of consequances (i.e. exceedences) come from 20% of the causes”]
The below Mean Residual Life Plot indicates for each potential threshold \(u\) its mean excess. As mentioned previously, the value of \(u\) from which the plot becomes approximately linear can generally be selected as the optimal threshold. The dashed curves represent the confidence intervals of the mean excesses which help assess its type (i.e. constant, linear). Therefore, we decide to choose a threshold \(u = 327\ [m.a.s.l]\) form which the curve seems to become linear.
From now on, daily water level maxima above \(327\ [m.a.s.l]\) are considered extreme values.
Therefore, below \(327\ [m.a.s.l]\), a threshold would be too low for the assumptions of the extreme value approach to be valid and above \(327\ [m.a.s.l]\) would be too high to obtain insghtful results because of too few data.
Considering that the mean excess function is used to validate a GPD model for the excess distribution, we see that the mean excess function would roughly be a straight line if we could zoom out on the y-axis, we can say that the GPD is a good fit for modelling the exceedences.
The below plot shows the results of the Peaks-Over-Treshold method. The red line represents the designated appropriate threshold \(u = 327\ [m.a.s.l]\) and the red dots are the extreme daily water maxima recorded and considered as exceedences.
The POT approach states that the observations above the high threshold \(u\) (i.e. red observations above 327 m.a.s.l) are in absolute terms not independent but it states that if they are not consecutive above \(u\), they belong to different clusters and therefore are considered independent only if they are at least a certain number of consecutive observations between them.
To avoid dependence between exceedences, we proceed to decluster the observations above the threshold \(u\). To do so, the declustering keeps only the highest water level maxima of one cluster as the key representative observation of the cluster. The maxima identified within each clusters will then be fit to the GPD.
The following plot shows that some exceedences above \(327 m.a.s.l\) become more transparent and some others remain more visible. The more visible observations represent the 79 observations (one observation per cluster declustered) obtained after the declustering.
##
## niveau$Wert declustered via runs declustering.
##
## Estimated extremal index (intervals estimate) = 0.6758134
##
## Number of clusters = 79
##
## Run length = 1
The GPD is used to describe statistical properties of excesses. Indeed, Extreme Value theory suggests that GPD is a natural approximation of the excess distribution function \(F_u(x)\) where \(x\) is the excess, so \(F_u(x) \approx G_{\xi, \beta}(x)\). [maybe add parameters conditions]
Under the assumption that exceedences are independent (i.e. probability of occurrence of one exceedance does not affect the probability of another), identically distributed (i.e. are random variables) and their distribution is asymptotic (i.e. ?), we fit the GDP model on the daily water level exceedences.
The below output shows the return level estimates for 50-year and 100-year events.
## extRemes::fevd(x = unlist(decluster), threshold = threshold,
## type = "GP", time.units = "365/year")
## get(paste("return.level.fevd.", newcl, sep = ""))(x = x, return.period = return.period)
##
## GP model fitted to unlist(decluster)
## Data are assumed to be stationary
## [1] "Return Levels for period units in years"
## 50-year level 100-year level
## 329.1207 329.3560
The below plot shows the return level estimates by return period and we see that it matches the values from above.
The Block Maxima method gives more efficient results when less exceedences are observed. This method may also be preferable when the observations are not exactly independent and identically distributed. For example, when there is seasonal periodicity in case of yearly maxima or, when there is short range dependence that plays a role within blocks but not between blocks.
It is less flexible than POT since in practice it might be hard to change the block size while it is easy in POT to move the threshold continuously.
Below is displayed the candlestick plot including the four stocks prices. Since Amazon has a much higher stock price (~1000$) than other stocks, it takes the most prominent size on the below plot compared to the others. You can use the zoom to see the detail of the three other stocks price evolution over time.
For all four stocks, it seems that the mean of the log-returns is 0 or close to 0 and that they are stationary. AAPL and AMZN become more volatile in 2019 and their shapes seem to be skewed. NVDA and TSLA seem to be more volatile and their shapes seem to indicate they follow a normal distribution.
# Peak-Over-Threshold
We choose a threshold of 0.05 based on the mean residual plot. We plot the mean over and we use a peak over threshold model and display its estimate of the uncertainty
## [1] "uncertainty"
## rlevel shape
## 0.1134840 0.2487138
We will now fit our stocks data of the upper tail beyond the chosen threshold of 0.05 to a GPD method.
## $threshold
## [1] 0.05
##
## $nexc
## [1] 32
##
## $conv
## [1] 0
##
## $nllh
## [1] -74.3385
##
## $mle
## [1] 0.01965532 0.60647468
##
## $rate
## [1] 0.04238411
##
## $se
## [1] 0.00558524 0.24858608
By using the POT approach, we built a model for high values of the negative log-returns where we obtain the Maximum Likelihood Estimates for the scale (sigma) and shape (ksi) parameters/coefficients : 0.01965532 and 0.60647468, with Standard Errors being 0.00558524 and 0.24858608.
Now we compute the 95% value at risk and we plot this value on a histogram
## 5%
## -0.04455341
Since we do not see a trend, we understand that the time series are stationary (i.e. depend on the time of the observation).
From the plot below, we do see that many of the high loads and peaks were from searches done on members of the British Royal family such as Queen Victoria and Elizabeth II. These peaks are not linked to any seasonal trends and are instead due to independent events. For instance, the peaks increase massively as of 2016 in relation to the royal family members. One potential reason for this could be the arrival of the Netflix Tv Series “The Crown” which debuted in 2016 and details the lives of the royals and which is watched by many people, hence why searches could be so high. Moreover, with members such as Elizabeth II and Prince Philip being such important figures and rather elderly, there have often been health scares which could have also sparked high searches in regards to their well-being and updates. We can also see that the Summer Olympic searches are high around the time of it happening or building up to the event. However, once the Olympic games ended, the searches fall drastically, as to be expected.
We have made a bar plot depicting the frequency of views and searches on a given topic/type.
The bar plot shows the distribution of total daily count relative to the different time series. We see that the biggest frequency is generated by web searches of all of the different topics that we have. The black line is the median of the total daily count. The total count falls after this median line which indicates that the total daily count is not normally distributed but is right-skewed meaning that the tail of the distribution displays extreme values on the right hand side of the median.
\[ R_i = \frac{X_i - X_{i-7}}{X_{i-7}} * 100 \] \(R_i\) is the relative weekly change (increase or decrease) in percentage.
When plotting the relative weekly change we see the highs and lows more clearly. There is a massive high load in the traffic count April 2016 for Princess Margaret and another big but smaller peak for Winston Churchill in February 2016.
For each time series (\(R_i,\ i = 1, ..., 12\)), we plot their daily count distribution on a Q-Q plot to assess if their data come from a theoretical distribution (e.g. Normal). Using the qq draws the correlation between a given sample and the normal distribution. The geom_line(…, distribution = stats::qqnorm) function visually checks the normality assumption of the distribution. If the assumption does not hold, we look into how/what data points contribute to the violation.
Results: on all below Q-Q plots, we observe that the daily count distributions follow a normal distribution when in the bulk of the distribution but departs from when in the tail indicating that the data is skewed. Indeed, it is right-skewed (i.e. positively skewed).
Knowing that the data is right-skewed we generate the time series’ Mean Residual Plot using the mrlplot() function from the extRemes package which plots the potential thresholds against the mean excess. The following plots are used to choose the most adequate thresholds. Indeed, the value of \(u\) from which the plot becomes approximately linear can generally be selected as the optimal threshold.
After setting their theoretical optimal threshold, we plot their Peaks-Over-Threshold Plot with their threshold \(u\) to display the exceedences. Note that if a threshold is too low then the extreme value approach cannot be valid and that if it is too high we cannot obtain insightful results because of too few data.
For the following modified time series where the POT is not suitable, we suggest using a Block Maxima approach.
The selected threshold \(u\) is 125.
The chosen threshold indicates that there are not too few data points exceeding the threshold nor to many. Therefore, we assume that using the POT model is suitable.
The Empirical Quantiles plot shows that the residuals of the modified daily count distributions does not really follow a gpd distribution since it does follow closely the regression line in the bulk of the distribution but move away from it at the end when in the tail. So the GPD model is probably appropriate to use for this time series. The Density plot allows us to see the overall shape of the residual distribution, here it a Bimodal distribution.
The selected threshold \(u\) is 35.
The chosen threshold indicates that there are few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.
The Empirical Quantiles plot shows that the residuals of the modified daily count distributions does not really follow a gpd distribution since it does follow quite well the regression line in the bulk of the distribution but move away from it at the end when in the tail. So the GPD model is probably appropriate to use for this time series. The Density plot allows us to see the overall shape of the residual distribution, here it is a Bimodal distribution.
The selected threshold \(u\) is 100.
The chosen threshold indicates that there are not too few nor too many data points exceeding the threshold. Therefore, we assume that using the POT model is suitable.
The Empirical Quantiles plot shows that the residuals of the modified daily count distributions does not really follow a gpd distribution since it does follow closely the regression line in the bulk of the distribution but move away from it at the end when in the tail. So the GPD model is probably appropriate to use for this time series. The Density plot allows us to see the overall shape of the residual distribution, here it seems to be a Bimodal distribution.
The selected threshold \(u\) is 30.
The chosen threshold indicates that there are not too few nor too many data points exceeding the threshold. Therefore, we assume that using the POT model is suitable.
The Empirical Quantiles plot shows that the residuals of the modified daily count distributions does not really follow a gpd distribution since it does follow closely the regression line in the bulk of the distribution but move away from it at the end when in the tail. So the GPD model is probably appropriate to use for this time series. The Density plot allows us to see the overall shape of the distribution, here the data seems right-skewed.
The selected threshold \(u\) is 25.
The chosen threshold indicates that there are not too few nor too many data points exceeding the threshold. Therefore, we assume that using the POT model is suitable.
The Empirical Quantiles plot shows that the residuals of the modified daily count distributions does not really follow a gpd distribution since it does follow closely the regression line in the bulk of the distribution but move away from it at the end when in the tail. So the GPD model is probably appropriate to use for this time series. The Density plot allows us to see the overall shape of the Residual distribution, here the data seems right-skewed.
The selected threshold \(u\) is 200.
The chosen threshold indicates that there are too few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.
The Empirical Quantiles plot shows that the residuals of the modified daily count distributions does not really follow a gpd distribution since it does follow closely the regression line in the bulk of the distribution but move away from it at the end when in the tail. So the GPD model is probably appropriate to use for this time series. The Density plot allows us to see the overall shape of the Residual distribution, here it seems to be a Bimodal distribution.
The selected threshold \(u\) is 50.
The chosen threshold indicates that there are too few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.
The Empirical Quantiles plot shows that the residuals of the modified daily count distributions does not really follow a gpd distribution since it does follow closely the regression line in the bulk of the distribution but move away from it at the end when in the tail. So the GPD model is probably appropriate to use for this time series. The Density plot allows us to see the overall shape of the Residual distribution, here the data seems right-skewed.
The selected threshold \(u\) is 700.
The chosen threshold indicates that there are too few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.
The Empirical Quantiles plot shows that the residuals of the modified daily count distributions does not really follow a gpd distribution since it does follow closely the regression line whether in the bulk of the distribution or at the end. So the GPD model is probably appropriate to use for this time series. The Density plot allows us to see the overall shape of the Residual distribution, here the data seems right-skewed.
The selected threshold \(u\) is 400.
The chosen threshold indicates that there are too few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.
The selected threshold \(u\) is 100.
The chosen threshold indicates that there are too few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.
The selected threshold \(u\) is 70.
The chosen threshold indicates that there are not too few nor to many data points exceeding the threshold. Therefore, we assume that using the POT model is suitable.
The Empirical Quantiles plot shows that the residuals of the modified daily count distributions does not really follow a gpd distribution since it does follow closely the regression line in the bulk of the distribution but move away from it at the end when in the tail. So the GPD model is probably appropriate to use for this time series. The Density plot allows us to see the overall shape of the Residual distribution, here the Residual distribution seems to be right-skewed.
The selected threshold \(u\) is 60.
The chosen threshold indicates that there are too few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.
The Empirical Quantiles plot shows that the residuals of the modified daily count distributions does not really follow a gpd distribution since it does follow closely the regression line in the bulk of the distribution but move away from it at the end when in the tail. So the GPD model is probably appropriate to use for this time series. The Density plot allows us to see the overall shape of the Residual distribution, here the data seems right-skewed.
The following outcome shows the 99%-quantile and uncertainty measures of the modified daily count modified for each time series. To generate these results we use the quantile function of the stats package and the standard error of the peaks-over-threshold model.
The graphical method that we suggest for detecting simultaneous high traffic loads is a Block Maxima interactive plot colored by topic. This allows to visually see the daily maxima between the 12 different web pages. If for one or several days that display high maxima but also other value from other webpages that are just below, it indicates a simultaneous high load.
By pointing the mouse cursor over data points, more information is available. With this plot, we can observe that Princess Margret, Georges IV, and Prince Philip of Edinburgh have the largest simultaneous high loads around 2016 November 6, but also have another smaller simultaneous high loads in 2016 April 21. So the idea for Wikipedia is that if one of these web pages load is increasing they should do caching on the other ones to prevent exhausting of available resources.
Another alternative method would be to generate scatter plots to visualize how two distributions move together.
For the numerical method, we compute a matrix of the tail dependence coefficients between the different web pages. We are interested in the values that are the closest to 1. They indicate that there is a high likelihood that if an extreme value in the tail arises for one web page, the other webpage is very likely to also display an extreme value in its high tail. Therefore, there is tail dependence and there is probably going to show a simultaneous high load for both (or more than two) pages. In this case, Wikipedia must anticipate and use the caching system on them.
Here, we use the tdc function form the FRAPO package.
| 2016_Summer_Olympics | Diana,_Princess_of_Wales | Elizabeth_II | George_VI | Prince_Philip,_Duke_of_Edinburgh | Princess_Margaret,_Countess_of_Snowdon | Queen_Victoria | United_Kingdom | United_States | Winston_Churchill | World_War_I | World_War_II | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2016_Summer_Olympics | 1.000 | 0.071 | 0.000 | 0.071 | 0.036 | 0.107 | 0.036 | 0.107 | 0.214 | 0.036 | 0.000 | 0.036 |
| Diana,_Princess_of_Wales | 0.071 | 1.000 | 0.071 | 0.071 | 0.143 | 0.071 | 0.071 | 0.036 | 0.036 | 0.000 | 0.000 | 0.036 |
| Elizabeth_II | 0.000 | 0.071 | 1.000 | 0.464 | 0.607 | 0.464 | 0.286 | 0.107 | 0.000 | 0.036 | 0.143 | 0.000 |
| George_VI | 0.071 | 0.071 | 0.464 | 1.000 | 0.536 | 0.571 | 0.321 | 0.071 | 0.036 | 0.071 | 0.107 | 0.036 |
| Prince_Philip,_Duke_of_Edinburgh | 0.036 | 0.143 | 0.607 | 0.536 | 1.000 | 0.536 | 0.214 | 0.071 | 0.000 | 0.036 | 0.107 | 0.036 |
| Princess_Margaret,_Countess_of_Snowdon | 0.107 | 0.071 | 0.464 | 0.571 | 0.536 | 1.000 | 0.214 | 0.071 | 0.071 | 0.000 | 0.107 | 0.036 |
| Queen_Victoria | 0.036 | 0.071 | 0.286 | 0.321 | 0.214 | 0.214 | 1.000 | 0.036 | 0.000 | 0.000 | 0.107 | 0.000 |
| United_Kingdom | 0.107 | 0.036 | 0.107 | 0.071 | 0.071 | 0.071 | 0.036 | 1.000 | 0.036 | 0.179 | 0.107 | 0.107 |
| United_States | 0.214 | 0.036 | 0.000 | 0.036 | 0.000 | 0.071 | 0.000 | 0.036 | 1.000 | 0.071 | 0.179 | 0.179 |
| Winston_Churchill | 0.036 | 0.000 | 0.036 | 0.071 | 0.036 | 0.000 | 0.000 | 0.179 | 0.071 | 1.000 | 0.107 | 0.071 |
| World_War_I | 0.000 | 0.000 | 0.143 | 0.107 | 0.107 | 0.107 | 0.107 | 0.107 | 0.179 | 0.107 | 1.000 | 0.429 |
| World_War_II | 0.036 | 0.036 | 0.000 | 0.036 | 0.036 | 0.036 | 0.000 | 0.107 | 0.179 | 0.071 | 0.429 | 1.000 |